« If you must poll, at least poll well... | Main | Implementations of RFC3229 with "feed" »

September 13, 2004

Using RFC3229 with Feeds.

[Updated: 22-Sep-2004 14:33 with link to list of implementations]

The other day I wrote that we really should be adopting RFC3229 "Delta Encoding in HTTP" in order to reduce the amount of bandwidth, etc. that is wasted in serving RSS and Atom files. I'm fairly convinced that if the folk at Microsoft had been using what I propose here, they would not have been forced to take the drastic measures that they did when they did.

Of course, use of RFC3229 would have only delayed, not eliminated the day when the current practice of polling for updates to RSS files would have become excessively expensive for Microsoft. The real solution to the bandwidth problem is to move from polling to a push-based solution. But, at least by implementing RFC3229, we can take the polling solution just about as far as it can be taken -- in terms of efficiency... This is a good intermediate step on the way to the push-based solutions that we won't have much choice but to implement as the audience for RSS and Atom data grows.

This post is intended to provide additional detail on what I'm proposing. It is my intention to create a Internet Draft describing the ideas here once I've had reasonable time to receive comments from folk and work out the inevitable bugs. Please feel free to comment on what is below:

Feeds aren't like HTML

Atom and RSS files are members of a distinct class of files that we call "feeds." Conceptually, a "feed" is a potentially infinitely long and growing series of items or entries. In order to reduce the cost of distributing new entries, RSS and Atom feeds are typically implemented as "sliding window" feeds. Such feeds don't contain every entry that has ever been published. Rather, they only contain some number of the most recent changes to the feed.

Common practice today is for feed providers to establish a certain fixed "window size" which defines the maximum number of entries that are contained in any instance of the feed file. Once the maximum number of entries has been reached, then every retrieval of the feed from then on will always receive that number of entries -- even if some smaller number of entries has been inserted into the feed since the last time the feed was retrieved by a specific client. The result is a great deal of wasted bandwidth and processing resource.

In order to allow the number of entries returned in a feed to be no more than the total number of new or modified inserted into the feed since the last time any specific client retrieved the feed, I propose that we rely on RFC3229 "Delta encoding in HTML" with a new instance-manipulation method defined to provide feed specific delta encoding.

The "feed" instance-manipulation method

The "feed" instance manipulation method is an abstract method for which concrete forms can be defined for use with various content types. In this document, I'll be speaking primarily about the use of Feed IM as appropriate for Atom files.

Unlike the IM methods currently registered for use with RFC3229, the "feed" IM method is not byte-oriented. Rather, it is item or entry oriented in that deltas are computed not in bytes but rather in whole items or entries. The definition of those items or entries is dependent on the underlying content type. For instance, if the content type is RSS, the delta unit is an "item". If the content type is Atom, the delta unit is an"entry". If the content type is "log file" then the delta unit is "lines".

When the "feed" IM method is applied to an instance, the result should conform to whatever are the syntactical requirements for the type of the instance. Thus, if the instance is Atom formatted, the result of applying the "feed" IM method would be Atom-conformant. This implies that the result would have atom:entry elements which would be wrapped in an atom:feed element that contained an atom:head element.

The detailed rules for applying "feed" instance-manipulation for various types should be easily derived from what is said above.

The requirement that the result of applying the "feed" IM method to an instance will result in a result of the same type as the instance presents an interesting opportunity. While byte-oriented IM methods must never be used unless specifically requested by a client -- since not every client may support the result or have the history needed to interpret it, this requirement need not exist for the feed IM method. Thus, servers that implement this method are free to apply it by default even if it is not requested.

feed: A worked example

The following shows what an RFC3229-compliant request to a server might look like:

GET /atom.xml HTTP/1.1
Host: bar.example.net
If-None-Match: "321"
A-IM: feed, gzip
  • The client wants to obtain the current value of /atom.xml
  • It has previously received an instance whose entity tag is "321"
  • It is willing to accept delta-encoded updates using the "feed" IM method. (Note: It is not strictly necessary for the client to request the "feed" IM method in all cases since some servers may actually apply this method by default. Nonetheless, it is good form to request it since some servers may not use the method unless it is requested.)
  • It is willing to accept responses that have been compressed using "gzip," whether or not these responses have been delta-encoded.

If, when this request is received, the server's current entity tag for the resource is still "321," then the server should simply return a 304 (not modified) response, as would a traditional server.

 

If the entity tag has changed, the server could compute the delta between the entity whose entity tag was "321" and the current instance. If the server no longer knows what the "321" entity tag corresponds to, it would probably send the entire feed.

If the client requests delta-encoding but the server doesn't support this form of instance manipulation, the server will simply ignore this aspect of the request.

If the server responds with a delta encoded response, it would look something like this:

HTTP/1.1 226 IM Used
ETag: "4321"
IM: feed, gzip
Date: Tue, 13 Sep 2004 18:30:05 GMT
Cache-Control: no-store, im
...
  • The response status is 226 IM Used -- a success code.
  • The entity tag given is that of the new state of the resource.
  • The response carries an "IM" response-header field, indicating which delta encoding is used in the response.
  • The Cache-control "no-store" is used to ensure that caches that do not understand delta-encoding do not cache this response. However, a cache that does understand the use of instance-manipulation is allowed to ignore the "no-store" directive which would otherwise be mandatory.
  • The message-body is first delta-encoded using the feed IM method appropriate for the type of feed and is then gzipped.

For a list of implementations of RFC3229 with the "feed" IM method, click here. For statistics showing the savings that have resulted from early implementations, click here.

f-range: A feed oriented Range

Just as it is appropriate to define a feed specific delta IM method, it is appropriate to provide a feed-specific IM method for range selection. RFC3229 currently only supports byte-oriented range selection.

 

The f-range IM method uses the content type's concept of item or entry as its unit of selection. Thus, "F-range: entries=1-20" would specify that the client only wanted to receive a maximum of 20 items starting at the "first" or most recent item in the feed. The specification "F-range: entries=20-" would indicate that all items, starting at the 20th oldest, should be returned in the result. All item offsets should be computed based on the state of the feed associated with the entity tag passed in the request or the If-F-Range if provided. Thus, it is possible for limited resource clients to "chunk" their way through a large number of available items in a fast moving feed.

When responding to an F-range request, the response should contain the entity tag associated with the feed at the time of the response and the cache control statements should be set to prevent caching.

f-range: A worked example

GET /atom.xml HTTP/1.1
Host: bar.example.net
If-None-Match: "321"
A-IM: feed, f-range, gzip
F-range: entries=1-20
  • This request asks first for all entries added since the entity tag "321"
  • The set of items in the response is limited to the most recent 20
  • The response should be gzipped.
HTTP/1.1 226 IM Used
ETag: "4321"
IM: feed, f-range, gzip
Date: Tue, 13 Sep 2004 18:30:05 GMT
Cache-Control: no-store
...

Benefits of the approach

Implementing and deploying the "feed" IM method will provide the same general benefits as are provided by the various byte-oriented IM methods of RFC3229. These benefits are:

  • A reduction in the mean size of HTTP responses, thereby improving latency and network utilization. For actual numbers which show savings from early implementations of RFC3229+feed, click here.
  • Avoidance of any extra network round trips
  • Minimization of per-request and per-response overheads.
  • Support for a variety of encoding algorithms and formats.
  • Interoperation with HTTP/1.0 and HTTP/1.1.
  • Fully optional for clients, proxies, and servers.
  • Moderately simple implementations are possible.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/261/1119291

Listed below are links to weblogs that reference Using RFC3229 with Feeds.:

» RFC3229 for partial feed retrieval from Niall Kennedy's Weblog
Bob Wyman, CTO of PubSub, details how RFC3229, "Delta Encoding in HTTP," could be used to help solve the bandwidth problem of syndicated feeds. In order to allow the number of entries returned in a feed to be no more... [Read More]

» RFC3229 enabled from Sam Ruby
Experimental support for RFC 3229 "feed" instance manipulation method: test cache.py has the delta function. [Read More]

» RFC3229 Feed Instante Maniuplation from PixelCort
The adoption of a new Instance Manipulation, as expressed in [RFC3229 with Feeds](http://bobwyman.pubsub.com/main/2004/09/using_rfc3229_w.html), seems to be unnecessary. Why can't the existing diff algorithms work? Why must we have yet another one,... [Read More]

» A-IM: feed support from Sam Ruby
User agents of clients that provide support for the RFC 3229 "feed" instance manipulation method: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.2) Gecko/20040803 FeedDemon/1.2 Beta 1 (http://www.bradsoft.com/; Microsoft Windows XP) Mozilla/5. [Read More]

» A-IM: feed support from Sam Ruby
User agents of clients that provide support for the RFC 3229 "feed" instance manipulation method: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.2) Gecko/20040803 FeedDemon/1.2 Beta 1 (http://www.bradsoft.com/; Microsoft Windows XP) Mozilla/5. [Read More]

» Manipulating Feeds from Windley's Enterprise Computing Weblog
Have you ever interrupted an HTTP download and then restarted it later and had it pick up where it left off? That little bit of magic is the result of RFC3229: Delta encoding in HTTP . [Read More]

» FeedDiff for Roller from Sam Ruby
Yesterday, I had lunch with Dave Johnson. He asked me how hard would it be to add support for the RFC 3229 "feed" instance manipulation method to Roller. I said that I would take a look into it. As an aside, I find designing caching logic some of th [Read More]

» RSS Bandit v1.3.0.26 Released from Dare Obasanjo aka Carnage4Life
[Read More]

» Reducing RSS overhead? from Tom Servo's Blogogogogog
[Read More]

» REST vs API from Sam Ruby
I spent four of the last six work days in all day meetings. While the meetings were about other things (primarily GlueCode and Zend related, in case you were wondering), I got to see that the basic fundamentals of REST are still widely misunderstood. [Read More]

» Reducing RSS overhead? from Tom Servo's Blogogogogog
Referred on the SubText site was a blog entry about how to reduce the overhead of RSS feeds. Overhead... [Read More]

» So, What's New? from franklinmint.fm
Ok, so the draft-sayre-atompub-protocol-basic-02 is pretty close to matching the current WG draft in capability. The question is whether that's... [Read More]

» So, What's New? from franklinmint.fm
Ok, so the draft-sayre-atompub-protocol-basic-02 is pretty close to matching the current WG draft in capability. The question is whether that's... [Read More]

» Vista and RFC 3229 from Cook Computing
Following my previous post about Vista RSS platform support for ttl, skipHours, and skipDays, Josh Christie emailed me that the platform will also support RFC 3229 - Delta Encoding in HTTP (see here on the IEBlog). The RFC specifies... [Read More]

Comments

Why not use Vary to indicate the additional headers which provide a complete caching context?

With respect to Range, I've just written up the use of this HTTP header for addressing and direct manipulation of subresources using the XPointer Framework for an extensible addressing platform. Since the conference (Extreme Markup 2004) has not yet (!?!) published the paper online, I'll send you a copy in email.

By the way, I did a similar thing with REST-ful queues where the parameters were in the query string. Worked fine. I used it to back an RSS service so that you could get slices of the RSS feed based on any indexed properties from the queue entries. I've been meaning to revisit this using the Range header.

-bryan

James Robinson reports that he has implemented diffe based RFC3229 support for Wordpress. Hopefully, he'll support the "feed" IM method as well... see:
http://www.robinsonhouse.com/2004/09/14/rss-and-delta-encoding/

bob wyman

Bob - I just implemented RFC3229 for ExpressionEngine this morning allowing both diffe and feed, and I am wondering if the Apache support is really the only thing holding this back now?

People here should be aware that ICE (Information & Content Exchange) is an XML-based syndication standard that has been providing incremental updates since 1998 or so. There are some real limitations of presenting "deltas" at the HTTP layer rather than the application layer, because (IMO) the semantics are more appropriate. That is, HTTP offsets are really intended to be byte offsets into the content of the resource being delivered via HTTP, while at the application (ICE) level it makes sense to have instructions such as "here's a new version of story 123" or "delete story 456, and add story 789".

For info on ICE2, see http://www.icestandard.org.

Is there anyway to indicate that an entry has been deleted? Would there need to be?

But isn't the problem of bandwidth not already been solved in HTTP? There are many headers in the HTTP specification that helps with this. Speaking as a blogging hoster who supplies GB's of RSS feeds each month, it amazes me how many hosts/feedreaders do not even attempt to present Last-Modified, or ETAG data to determine if they have the latest information already. If they were to do this at least, then the amount of bandwidth consumed would drop like a stone overnight.

So while i applaud this effort, i can't see it solving the problem. It is just another layer of administration that feedreaders will not implement. Its laziness and inability to appreciate that HTTP has already answered many of the problems facing our RSS world today. Previously it was easy; it was largely down to the browser engineers; and since there was a finite amount of them, they all adhered to the standard. However, since there is so many different RSS readers out there, including all the RSS Developers APIs for reading and parsing feeds, no one is bothering to think about the poor HTTP protocol wire its eventually going out on. So when a developer creates his new RSS reader, its not his bandwidth they are consuming, but instead the countless clients of their creation. Why should he be worried about the bandwidth?

Therefore, instead of actually using the tools and headers available, we find ourselves having to create more standards to solve the problem. I don't think its the way to go.

Alan, certainly we would all be better off if clients actually implemented conditional GETs with "if-modified-since" or Etags. However, even if they did, we would still be wasting bandwidth in the case where some of a feed has changed but not all of it. In terms of priority, supporting conditional GETs is certainly higher then RFC3229+feed, however, once you've provided support for conditional GETs, the incremental effort to support RFC3229+feed is trivial while the benefits are great. As I've documented elsewhere on this blog, we saw a massive drop in bandwidth needs the moment we implemented RFC3229+feed ourselves. This is because many of the more popular feed readers *have* built in support for RFC3229+feed. If you were to implement this at blog-city, I think you would also find massive improvements -- even though there are still many clients that don't support it.

bob wyman

Great site, well done. I enjoy beeing here and i´ll come back soon. You do a great job. Many greetings.

Great site, well done. I enjoy beeing here and i´ll come back soon. You do a great job. Many greetings.

great article adn well written

You are right. I lose lot of bandwidth to the feeds

Hi,

Can I use delta-encoding in cache-nocache multipart messages?

How to apply delta-encoding inside an html page?

Thanks a lot

Post a comment

If you have a TypeKey or TypePad account, please Sign In